Classi cation of News Stories Using Support Vector Machines
نویسنده
چکیده
Given a data set and a data mining task such as classiication, there are two main reasons for performing feature space reduction. The rst is to improve the accuracy of the algorithm. In a domain such as text mining, the common technique of parameterizing each document as a vector of words results in thousands of dimensions. The performance of many learning algorithms decreases as the dimensionality of the input space increases. Support Vector Machines (SVMs) Vap95], which are based on Vapnik's statistical learning theory, can be used as a clas-siication technique and have been shown by Joachims Joa98] to be reasonably immune to the high dimensionality of text feature spaces. The second reason for performing feature space reduction is to decrease the overall size of the data set in order to conserve storage space and minimize the amount of time required to handle the data and run the mining algorithms. Even with SVMs, very large data sets may warrant feature space reduction because of this second class of problems. This paper describes the results of an experiment to train SVMs to classify print, television, and radio news sources. Tests were performed to compare full text versus feature space reduction using a natural language processing technique and reduction using information gain. The results show that while the size of the data set can be reduced by an order of magnitude with natural language processing, this results in a signiicant loss in both recall and precision. However, both the precision and recall achieved with the SVMs trained with the full text and information gain representations were higher than what was achieved with the K-nearest-neighbors algorithm. Also, three term weighting methods, TFIDF, TF, and binary are compared for use with the SVMs. Representations with TFIDF and TF weights produced similar results, while the binary weighting method resulted in a signiicant loss in recall.
منابع مشابه
Transductive Inference for Text Classi cation using Support Vector Machines
This paper introduces Transductive Support Vector Machines (TSVMs) for text classi cation. While regular Support Vector Machines (SVMs) try to induce a general decision function for a learning task, Transductive Support Vector Machines take into account a particular test set and try to minimize misclassi cations of just those particular examples. The paper presents an analysis of why TSVMs are ...
متن کاملSupport Vector Learning for Fuzzy Rule - Based Classi cation Systems
|To design a fuzzy rule-based classi cation system (fuzzy classi er) with good generalization ability in a high dimensional feature space has been an active research topic for a long time. As a powerful machine learning approach for pattern recognition problems, support vector machine (SVM) is known to have good generalization ability. More importantly, an SVM can work very well on a high (or e...
متن کاملMaximal Margin Classi cation using the
Support Vector Machines have been successfully used in a number of applications such as the recognition of digits in postal codes and face recognition. However, their conventional implementation involves the use of quadratic programming techniques which are slow and not easy to implement. We outline a simple learning procedure for nd-ing the same maximal margin hyperplane in the feature space f...
متن کاملHierarchical classification and feature reduction for fast face detection with support vector machines
We present a two-step method to speed-up object detection systems in computer vision that use support vector machines as classi ers. In the rst step we build a hierarchy of classi ers. On the bottom level, a simple and fast linear classi er analyzes the whole image and rejects large parts of the background. On the top level, a slower but more accurate classi er performs the nal detection. We pr...
متن کاملMargin Maximizing Loss Functions
Margin maximizing properties play an important role in the analysis of classi£cation models, such as boosting and support vector machines. Margin maximization is theoretically interesting because it facilitates generalization error analysis, and practically interesting because it presents a clear geometric interpretation of the models being built. We formulate and prove a suf£cient condition fo...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 1999